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ABSTRACT 

We address the problem of performing semantic transfor- 
mations on strings, which may represent a variety of data 
types (or their combination) such as a column in a relational 

table, time, date, currency, etc. Unlike syntactic transfor- 
mations, which are based on regular expressions and which 
interpret a string as a sequence of characters, semantic trans- 
formations additionally require exploiting the semantics of 
the data type represented by the string, which may be en- 
coded as a database of relational tables. Manually perform- 
ing such transformations on a large collection of strings is 
error prone and cumbersome, while programmatic solutions 
are beyond the skill-set of end- users. We present a pro- 
gramming by example technology that allows end-users to 
automate such repetitive tasks. 

We describe an expressive transformation language for 
semantic manipulation that combines table lookup opera- 
tions and syntactic manipulations. We then present a syn- 
thesis algorithm that can learn all transformations in the 
language that are consistent with the user-provided set of 
input-output examples. We have implemented this technol- 
ogj' as an add-in for the Microsoft Excel Spreadsheet system 
and have evaluated it successfully over several benchmarks 
picked from various Excel help-forums. 

1. INTRODUCTION 

The IT revolution over the past few decades has resulted 
in two significant advances: digitization of massive amounts 
of data and accessibility of computational devices to massive 

proportions of the population. It is thus not surprising that 
more than 500 million people worldwide use spreadsheets for 
storing and manipulating data. These business end-users 
have myriad diverse backgrounds and include commodity 
traders, graphic designers, chemists, human resource man- 
agers, finance pros, marketing managers, underwriters, com- 
pliance officers, and even mail room clerks they are not 
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professional programmers, but they need to create small, 
often one-off, applications to support business functions [6]. 

Unfortunately, the programming experience since incep- 
tion has mostly focused on serving the needs of a select class 
of few million skilled users. In particular, spreadsheet sys- 
tems like Microsoft Excel allow sophisticated users to write 
macros using a rich inbuilt library of string and numerical 
functions, or to write arbitrary scripts using a variety of 
programming languages like Visual Basic, or .NET. Since 
end-users are not proficient in programming, they find it too 
difficult to write desired macros or scripts. 

The combination of the above-mentioned technical trends 
and lack of a satisfactory solution has led to a marketplace 
of hundreds of advertisement-driven help- forums^ , some of 
which contain millions of posts from end-users soliciting help 
for scripts to manipulate data in their spreadsheets. The ex- 
perts respond to these requests after some time. From an 
extensive case study of such spreadsheet help-forums, we ob- 
served the following two things. 

Semantic String Transformations: Several of the re- 
quested scripts/macros were for manipulating strings that 
need to be interpreted as more than a sequence of charac- 
ters, e.g., as a column entry from some relational table, or 
as some standard data type such as date, time, or currency. 
(See §2 for a motivating example.) In this paper, we de- 
scribe the systematic design of a semantic transformation 
language for manipulating such strings. 
Input-Output Examples based Interaction Model: 
End-users used input-output examples as the most com- 
mon and natural way of expressing intent to experts on the 
other side of the help-forums. An expert provides a pro- 
gram/transformation that is consistent with those examples. 
If the end-user is not happy with the result of the program 
on any other new input in the spreadsheet, the interaction 
is repeated with an extended set of input-output examples. 
This is the natural interface that our tool provides to the 
end-user. We describe the systematic design of an (induc- 
tive) synthesis algorithm that can learn desired scripts in 
our transformation language from very few examples. 

We observe that most semantic transformations can be ex- 
pressed as a combination of lookup transformations and syn- 
tactic transformations. We use this observation to present 
a systematic design of the transformation language for per- 
forming semantic string transformations. We first present 
an expressive language for lookup transformations and then 
extend it by adding syntactic transformations [8]. 

^http://wvsrw.excelforum.com/, http://www.ozgrid.com/forum/, 
http://www.mrexcel.com/, http://wvsrw.exceltip.com/ 
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Input VI 


Input V2 


Output 


Stroller 


10/12/2010 


$145.67+0.30*145.67 


Bib 


23/12/2010 


$3.56+0.45*3.56 


Diapers 


21/1/2011 


$21.45+0.35*21.45 


Wipes 


2/4/2009 


$5.12+0.40*5.12 


Aspirator 


23/2/2010 


$2.56+0.30*2.56 



MarkupRec 


Id 


Name 


Markup 


S30 


Stroller 


30% 


B56 


Bib 


45% 


D32 


Diapers 


35% 


W98 


Wipes 


40% 


A46 


Aspirator 


30% 



CostRec 


Id 


Date 


Price 


S30 


12/2010 


$145.67 


S30 


11/2010 


$142.38 


B56 


12/2010 


$3.56 


D32 


1/2011 


$21.45 


W98 


4/2009 


$5.12 


A46 


2/2010 


$2.56 



Figure 1: A transformation that requires perfoming 
syntactic manipulations on multiple lookup results. 

We also describe a systematic design of the synthesis al- 
gorithm for the semantic string transformation language, 
which can synthesize a set of semantic transformations that 
are consistent with the given set of input-output examples. 
We first describe a synthesis algorithm for the lookup trans- 
formation language Lt, and then extend it to a synthesis 
algorithm for the extension of Lt with syntactic transfor- 
mations. Experimental results on our benchmark examples 
show that our algorithm is scalable and can learn desired 
transformations from very few examples. 

This paper makes the following key contributions: 

• We describe a lookup transformation language Lt and an 
inductive synthesis algorithm for it (§4). 

• We observe that most semantic transformations can be 
expressed as combination of lookup and syntactic trans- 
formations. We extend language Lt with syntactic trans- 
formations to obtain a very expressive language L„ and 
describe an inductive synthesis algorithm for it (§5). 

• We describe an experimental prototype of our system 
that has an attractive user-interface and is ready to be 
deployed. We present the evaluation of our prototype 
on a large collection of benchmarks obtained from help- 
forums, books, mailing lists and Excel product team (§7). 

This paper is organized as follows. §3 establishes a com- 
mon framework for describing transformation languages and 
their inductive synthesis algorithms. §4 and §5 describe 
novel instantiations of this framework for a lookup trans- 
formation language and for its extension with a syntactic 
transformation language. §6 shows applications of the se- 
mantic transformation language (described in §5) for per- 
forming transformations on standard data types. 

2. MOTIVATING EXAMPLE 

Consider the following example from an Excel heljvforum. 

Example 1. A shopkeeper wanted to compute the selling 
price of an item (Output) from its name (Input Vi) and 
selling date (Input V2) using the MarkupRec and CostRec 

tables as shown m Figure 1. The selling price of an item is 
computed by adding its purchase price (for the corresponding 
month) to its markup charges, which in turn is calculated by 
multiplying the markup percentage by the purchase price. 

The user expresses her intent by giving a couple of exam- 
ples (i.e., the first two rows). Our tool then automates the 



repetitive task (i.e., fills in the bold entries). We highlight 
below the key technical challenges involved. 

Expressive transformation language. The transforma- 
tion/program inferred by our system for automating the 

repetitive task involves both lookup and syntactic opera- 
tions. In particular, note that we need lookup operations 
for (i) obtaining the markup percentage from an item name 
in MarkupRec table (Stroller — > 30%) and (ii) for obtaining 
the purchase price of item in CostRec table after perform- 
ing a join operation between the two tables on the item 
Id column (Stroller, 12/2010 $145.67). Observe that the 
string 12/2010 used for performing the second lookup is ob- 
tained by performing a syntactic transformation (namely, a 
substring operation) on the input string 10/12/2010. After 
performing the lookups, we need a syntactic transforma- 
tion (namely, a concatenate operation) to concatenate the 
lookup outputs with constant strings like + , . , * in a par- 
ticular order to generate the final output string. We present 
an expressive transformation language that combines lookup 
and syntactic transformations in a nested manner. 

Succinct representation & efficient computation of large 
number of consistent transformations. The number of 
expressions in our expressive transformation language that 
are consistent with a given input-output example can poten- 
tially be very large. For example, for the first input-output 
example (Stroller, 10/12/2012 $145.67+0.30*145.67), 
there are a large number of transformations that can gen- 
erate the output string. In general, every substring in the 
output string can potentially be either a constant string, or 
a substring of an input string, or the result of a lookup op- 
eration. For example, the substring 30 in the output string 
can either be a constant string or a string obtained by per- 
forming a lookup operation in the MarkupRec or CostRec 
table. Some of the possible lookup transformations to ob- 
tain the string 30 include selecting the Markup column entry 
in the MaxkupRec table where the item Name in the corre- 
sponding row is either one of the constant strings Stroller 
or Aspirator, or it matches the input string vi. Another 
valid lookup transformation is to select the last two charac- 
ters from the item Id column (S30) in MarkupRec or CostRec 
table with various ways to select the first row by constrain- 
ing the item Name or Date columns respectively. We thus 
have a large number of possible transformations for each 
substring of the output string and explicit enumeration of 
all such choices becomes infeasible. A key technical contri- 
bution of this paper is a data structure that can succinctly 
represent an exponential rmmber of such transformations in 
polynomial space, and an algorithm that can compute such 
transformations in polynomial time. The key idea is to share 
common sub-expressions and compute/maintain choices for 
independent sub-expressions independently. 

Ranking. Our synthesis algorithm learns the set of all con- 
sistent transformations for each example and then intersects 
these sets to obtain the common transformations. The num- 
ber of examples required to converge to the desired trans- 
formation may be large. To enable learning of the desired 
transformation from very few examples, we perform a rank- 
ing of these learned transformations that gives preference to 
transformations that are smaller (Occam's razor principle) 
and that use fewer constants (to enforce generalization). 
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3. OVERVffiW 

We describe our expressive transformation language and 
associated data structures and algorithms for inductive syn- 
thesis in two steps (§4 and §5). In this section, we introduce 
the common formalism and our user interaction model (as 
derived from recent work on inductive synthesis [8, 9]). 

3.1 Formalism 

Transformation language L. The first step in inductive 
synthesis is to define a domain specific language that is ex- 
pressive enough to capture several real-world tasks, but at 
the same time is restrictive enough to enable efficient learn- 
ing from input-output examples. In this paper, we introduce 
a string transformation expression language L, which con- 
tains expressions e that map an input state a, which holds 
values for m string variables «i, .., Vm (denoting the multiple 
input columns in a spreadsheet), to an output string s. 

e : (String x . . . x String) -)■ String 
This formalism can also be used for string processing tasks 
that require generating a tuple of n strings as output by 
solving n independent problems. We characterize an ex- 
pression language L with the following components: (a) a 
set of grammar rules R, (b) a top-level symbol e, which is a 
uniquely distinguished non-terminal symbol occurring in R. 
§4.1 and §5.1 describe two examples of such a language. 

Data structure Dfor set of expressions in L. The num- 
ber of transformation expressions that axe consistent with a 
given set of input-output examples can be huge. We define 

a data structure D to succinctly represent such a large set 
of expressions. We describe D itself using a set of gram- 
mar rules R with top-level symbol e. §4.2 and §5.2 de- 
scribe examples of such a data structure that uses a top-level 
graph/DAG representation respectively. 

Synthesis algorithm: GenerateStr and intersect. The 
inductive synthesis algorithm Synthesize for an expression 
language L learns the set of expressions in L (represented 

using data structure D) that are consistent with a given set 
of input-output examples. Our synthesis algorithm consists 
of the following two procedures: 

• The GenerateStr procedure for computing the set of all 

expressions (represented using data structure D) that are 
consistent with a given input-output example. 

• The Intersect procedure for intersecting two sets of ex- 
pressions (represented using data structure D). We de- 
scribe Intersect procedure also using a set of rules. 

Definition 1. (Soundness/k-cornpleteness of Generate- 
Str) Let e = GenerateStr{a, s) . We say that GenerateStr 
procedure is sound if all expressions m e are consistent with 
the input-output example (cr, s). We say that GenerateStr 
procedure is k-complete if e includes all expressions with at 
most k recursive sub- expressions that are consistent with the 
input-output example {cr,s). 

Definition 2. (Soundness/Completeness of Intersect) 
Let e" = Intersect{e,e'). We say that Intersect is sound 
and complete iff e" includes all expressions that belong to 
both e and e' . 

§4.3 and §5.3 give examples of GenerateStr/Intersect 
procedures that are sound and k-complete/complete and 
that have polynomial time complexity. 



The synthesis algorithm Synthesize involves invoking the 
GenerateStr procedure on each input-output example, and 
intersecting the results using the Intersect procedure: 
Synthesize ((ai, si), . . . , {a„, s„)) 

1 P := GenerateStr ((Ti, si) ; 

2 for i = 2 to ri: 

3 P' := GenerateStr(a-i, Si) ; P := Intersect(P, P') ; 

4 return P; 

Ranking. An expressive domain-specific language L for in- 
ductive synthesis can often require a large number of ex- 
amples to learn the intended transformation. We address 

this problem by developing a ranking scheme that can be 
used to rank the possibly large number of transformation ex- 
pressions that are consistent with a small number of input- 
output examples. This ranking scheme is inspired by the 
Occam's razor principle, which states that a smaller and 
simpler explanation is usually the correct one. We define a 
comparison scheme between different expressions by defin- 
ing a partial order between them. Any partial order can be 
used that is consistent with the sharing used in data struc- 
ture D for succinct representation of sets of such expressions. 
In other words, the comparison of any two sub-expressions 
should be based only on attributes that are not shared at 
the level of the sub-expressions. This allows us to efficiently 
identify the top ranked expressions from among a set of ex- 
pressions represented using D. We give examples of such a 
ranking scheme in §4.4 and §5.4. Some of these choices are 
subjective, but our experiments illustrate that our ranking 
scheme works very effectively in practice: all of our bench- 
mark tasks required at most 3 input-output examples. 

3.2 User Interaction Model 

The user expresses her intent of the task using few input- 
output examples. The synthesizer based on the above for- 
malism then generates a ranked set of transformations that 
are consistent with those examples. We describe below some 
interaction techniques for automating the desired task or for 
generating a reusable transformation. 

The user can run the top-ranked synthesized transforma- 
tion on other inputs in the spreadsheet and check the gener- 
ated outputs. If any output is incorrect, the user can fix it 
and the synthesizer can repeat the learning process with the 
additional example that the user provided as a fix. Requir- 
ing the user to check the results of the synthesizer, especially 
on a large spreadsheet, can be cumbersome. To enable easier 
interaction, the synthesizer can run all transformations on 
each new input to generate a set of corresponding outputs 
for that input. The synthesizer can then highlight those in- 
puts (for user inspection) whose corresponding output set 
contains at least two outputs. The user can then focus their 
inspection on the highlighted inputs. Our prototype, imple- 
mented as an Excel add-in, supports this interaction model 
(which is also used in [8]). 

On the other hand, if the user wishes to learn a reusable 
script, then the synthesizer may present the set of synthe- 
sized transformations to the user. Either the top-k trans- 
formations can be shown, or the synthesizer can walk the 
user through the data structure that succinctly represents 
all consistent transformations and let the user select the de- 
sired one. The transformations can be shown using the sur- 
face syntax, or can be paraphrased in a natural language. 
The differences between different transformations can also 
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be explained by synthesizing a distinguishing input on which 
the transformations behave differently [11]. After receiving 
the correct output from the user on the distinguishing in- 
put, the synthesizer can repeat the learning process with 
this additional example. 

4. LOOKUP TRANSFORMATIONS 

In this section, we present a lookup transformation lan- 
guage Lt that can model transformations that involve map- 
ping a tuple of strings to another string using (possibly 
nested) lookup operations over a given database of relational 
tables. We first present the syntax and semantics of Lt and 
then present a data structure Dt to succinctly represent a 
large set of expressions in the language. We then present 
an efficient synthesis algorithm to learn a set of transforma- 
tions in Lt from a set of input-output examples, such that 
each of the learned transformations when run on the given 
inputs produces the corresponding outputs. 

4.1 Lookup Transformation Language Lt 

The syntax of our expression language Lt for lookup trans- 
formations is defined in Figure 3(a). An expression e* is 
either an input string variable Vi, or a select expression 
denoted by Select(C, T, 6), where T is a relational table 
identifier and C is a column identifier of the table. The 
Boolean condition b is an ordered conjunction of predicates 
pi A ... A pn where predicate p is an equality comparison 
between the content of some column of the table with a 
constant or an expression. We place a restriction on the 
columns present in these conditional predicates namely that 
these columns together constitute a candidate key of the ta- 
ble. The main idea behind this restriction is that we want 
to express queries that produce a single output as opposed 
to a set of outputs. The ordering of predicates results in an 
efficient intersection algorithm as described in §4.3. 

The language Lt has expected semantics. The expression 
Select(C, T, 6) returns the table entry T[C,r], where r is 
the only row that satisfies condition b (as condition b is over 
candidate keys of the table). If there exists no row r whose 
columns satisfy 6, the expression returns the empty string 
e. The predicate C = e* is evaluated for row r by first eval- 
uating the expression Ct and then comparing the returned 
string [etjcr with the string T[C, r]. 

We now present an example taken from an Excel help- 
forum that can be represented in Lt- 

Example 2. An Excel user was working on two tables: 
CustData and Sale. The user wanted to map names of cus- 
tomers to the selling price using address and street num- 
ber columns as common columns between the two tables and 

posted the example shown in Figure 2 on a help-forum. 

The transformation can be expressed in Lt as 
Select(Price, Sale, Addr = Select(Addr, CustData, 
Name = ui) A St = Select(St, CustData, Name = vi)). 
The surface syntax of Lt allows sharing of sub-expressions 

(which is the key principle used in data structure Dt de- 
scribed in §4.2). To appreciate this, consider the following 
example, which is also our running example in this section. 

Example 3. Consider m tables Ti to Tm, each contain- 
ing three columns Ci, C2, and C3 with Ci being the primary 
key. Suppose table Ti contains a row (si,Si+i,Si+2). Now 
given an input-output example si Sm, we want to com- 
pute all expressions in Lt that are consistent with it. 



Input VI 


Output 


Peter Shaw 


110 


Gary Lamb 


225 


Mike Henry 


2015 


Sean Riley 


495 



CustData 


Name 




St 


Scan Hiicy 


l:i2 


15111 


Peter Shaw 


24 


18th 


Mike Henry 


432 


18th 


Gary Lamb 


104 


12th 









Sale 


Addr 


St 


Date 


Price 


21 


ISlli 


5 21 


110 


104 


12th 


5/23 


225 


432 


18th 


5/20 


2015 


432 


15th 


5/24 


495 











Figure 2: A lookup transformation that requires 
joining two tables. 

Consider the case of m = 4. Let e = Select (C2, Ti, Ci = 
«i) that produces string S2. Two possible expressions in Lt 
to obtain output S4 from input s\ are: (i) ei = Select(C3, T2, 
Ci = e) (corresponding to path Si — >■ S2 — >■ S4) and (ii) 
e2 = Select(C2,r3,Ci = Select(C2, T2, Ci = e)) (corre- 
sponding to path si — ^ S2 — ^ S3 — ^ sa). The expressions 
ei and 62 share the common sub-expression e which corre- 
sponds to obtaining the intermediate string S2 . 

4.2 Data Structure for Set of Expressions in Lt 

The set of expressions in language Lt that are consistent 
with a given input-output example can be exponential in 
the number of reachable table entries. We represent this set 
succinctly using the data structure Dt, which is described 
in Figure 3(b). The data structure consists of a generalized 
expression et, generalized Boolean condition 6, and gener- 
alized predicate p (which respectively denote a set of Lt 
expressions, a set of Boolean conditions 6, and a set of pred- 
icates p). The formal semantics |.J of the data structure is 
described in Figure 3(c). The generalized expression Ct is 
represented using a tuple (77, r?*. Progs) where ?7 is a set of 
nodes containing a distinct target node rf (representing the 
output string), and Progs : r) 2^ maps each node r/ G fj 
to a set consisting of input variables Vi or generalized select 
expressions Select(C, T,B). A generalized select expression 
takes a set of generalized Boolean conditions bi as the last 
argument, where each bi corresponds to some candidate key 
of table T. A generalized conditional 6 is a conjunction of 
generalized predicates pi, where each pi is an equality com- 
parison of the j*^ column of the corresponding candidate key 
with a constant string s or some node fj or both. There are 
two key aspects of this data structure which are explained 
below using some worst-case examples. 

Use of intermediate nodes in 77 for sharing: Con- 
sider the problem in Example 3. The set of all transfor- 
mations in Lt that arc consistent with the example si — > Sm 
can be represented succinctly using our data structure as: 
{{rji,. . . , r;m},J7m, Progs}, where Progsf?],] = {Select(C2, 
Ti_i,{Ci = {si_i,r;i_i}}),Select(C3,r,_2,{Ci = {si-2, 
??,_2}})}, Progs[??i] = {vi}, andProgs[r?2] = {Select (C2, Ti, 
{Ci — {si,rii}})}. The node rji essentially corresponds to 
the string Si. Figure 4 illustrates how the various nodes 
can be reached or computed from one another. Let N(i) 
(icuote the rmmber of expressions represented succinctly by 
Progs[r;i]. We have N(i) = 2 + N(i-l) + N(i-2), imply- 
ing that N(i) = 0(2'). Observe how our data structure 
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Expression et := 
I 

Boolean Cond h := 
Predicate p := 
I 

(a) 



Select(C,T,6) 

: pi A ... A p„ 
-. C = s 
C = et 



et 

f 
B 
b 
P 



(?7,?7*, Progs) 

where Progs : f} ^ 2^ 
I Select(C,r,_B) 

pi A ... A p„ 

C = s\ C = r) 

C = {s,r,} 
(b) 



|(?7,r?*, Progs)! 

|Select(C,T, {bi}i)l 
Ipi A . . . A p„l 

lC = r)\ 

ic = {s,n}} 



{et|e«e[/l,/eProgs[7?*]} 

{Select(C,T,6) \b£ (bi}} 
{pi A... Ap„ Ipj e fell 

{C = s} 

{C = et I et e [ProgsMll 

[c = si u [C7 = nl 

(c) 



Figure 3: (a) The syntax of lookup transformation language Lt, (b) and (c) describe the syntcLX and semantics 
of data structure Dt for succinctly representing a set of expressions from language Lt. 



Figure 4: The reachability graph of nodes in Ex. 3. 

makes use of the set of nodes {rji, . . . , rim} to succinctly rep- 
resent 0(2™) transformations in 0{m) space. 

Exploiting CNF form of boolean conditions: The 
second key aspect of our representation is exploiting the 
CNF form of boolean conditions to succinctly represent a 
huge set b of conditionals. Consider a table T with n + 1 
columns Ci, . . . ,Cn+i, where the first n columns consti- 
tute a primary key of the table and the table contains an 
entry (si, S2, . . . , s„, t). Consider the input-output exam- 
ple {S1,S2,. . . ,Sm) ->■ t with Si = S2 = ■■■ = Smax(m,n)- 

The number of transformations that are consistent with the 
given input-output example are (m + 1)" because for in- 
dexing into each column d of the table, we have m + 1 
choices namely constant si and the input string variables 
vi,. . . ,Vm. This huge set of transformations can be repre- 
sented succinctly in 0{n+m) space using our data structure 
as ({r?i,??2},??2, Progs), where Progs[r;i] = {vi, . . . ,«,„}, and 

n 

Progs[?72] = {Select(C„+i,r,6)}, b = /\{Ci = {si,?7i}). 

Theorem 1 (Properties of data structure Dt). 

(a) The number of transformations in Lt that are consistent 
with a given example may be exponential in the number of 
reachable entries and number of columns in a candidate key. 

(b) However, the data structure Dt can represent these po- 
tentially exponential number of transformations in polyno- 
mial size in number of reachable entries, number of candi- 
date keys and number of columns in a candidate key. 

Proof of (a) follows from the two examples discussed above, 
while proof of (b) follows from Theorem 2(a). 

4.3 Synthesis Algorithm for Lt 

Procedure GenerateStrt 

The GenerateStrt procedure, shown in Figure 5(a), oper- 
ates by iteratively computing a set of nodes i) and updating 
two maps Progs and val in the loop at Line 7. Each node 
r) £ fj represents a string val(r;) that is present in some 
table entry. The inverse map val~^(a) returns the node 
corresponding to string a or if no such node exist. The 
map Progs associates every node 77 to a set of expressions 



(of depth at most steps), each of which evaluates to string 
val(?7) on the input state o. The key idea of the loop at 
Line 7 is to perform an iterative forward reachability anal- 
ysis of the string values that can be generated in a single 
step (i.e., using a single Select expression) from the string 
values computed in previous step, with the base case being 
the values of the input string variables. 

Each iteration of the loop at Line 7 results in considera- 
tion of expressions whose depth is one larger than the set of 
expressions considered in the previous step. The depth of 
the expressions in language Lt can be as much as the total 
number of entries in all of the relational tables combined. 
Since we have not observed any intended transformation 
that requires self-joins, we limit the depth consideration to 
a parameter k whose value we set to be equal to the rmmbcr 
of relational tables present in the spreadsheet. One might 
be tempted to use the predicate {s € fjW f)aiA = r?) as a 
termination condition for the loop. However, this has two 
issues. The first issue is that it may happen co-incidentally 
that the output string s is computable by a transformation 
of depth smaller than the depth of the intended transforma- 
tion on a given example, and in that case we would fail to 
discover the correct transformation. The other major issue 
is that it might also happen that the intended transforma- 
tion does not belong to the language Lt, in which case the 
search would fail, but only after consideration of all expres- 
sions whose depth is as large as the total number of entries 
in all relational tables combined together. 

The generalized boolean condition B is computed to be 
the set of all boolean conditions that uniquely identify row r 
in table T (Line 10). It considers the set of candidate keys of 
table T and for each column C' in a candidate key it learns 
the generalized predicate: C" = {TfC, r], val [r[C", r]]}. 

During the reachability computation, a node 77 can be 
reached through multiple paths and therefore the set of ex- 
pressions associated with the node (Progs [77]) needs to be 
updated accordingly. When a node is revisited, the algo- 
rithm computes the Select expression with updated set of 
boolean conditions B and adds it to the set (Line 15). 

We now briefly describe how the GenerateStrt procedure 
computes the set of expressions for each node in Exam- 
ple 3. It first creates a node rji such that Progs[77i] = {vi}, 
val(77i) = si and the frontier of reachable nodes is set as 
??diff = We use node rji to denote the node cor- 

responding to string Si such that val(r/i) = Si. The al- 
gorithm then finds that the table entry Ti [Ci , 1] is reach- 
able from node rji with the generalized boolean condition 
B = {Ci = {si,wi}}. The algorithm then makes the other 
column entries in the row, namely Ti[C2,l] and Ti[C3,l] 
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Intersectt((r^i,77LProgSi), = (771 x ^2, (j?^ t?2), Progs^j) 
(?72,??2,ProgS2)) 
where ProgSj2[(??i,''72)] = Intersectt(ProgSi [tji], ProgS2[??2]) 



Vi 



GenerateStrt (a : Input state, s: Output string) 

1 fj:=0; »7oid:=0; steps — 0; 

2 foreach input variable Vi: 

3 if ((r; := val-i(cr(i;0)) = -L) 

4 then { ?7 := NewNode(); 77 := U {77} ; 

5 val(r7) := a(«i);Progs[?7] := 0; } 

6 Progs[r7] ~ Progs[?7] U {vi}; 

7 while (steps++ < k A fjoid 7^ ??) 

8 ?7diff ~ fj — fjoia; fjoid ~ v; 

9 foreach table T, col C, tovi r s.t. 

T[C,r\ = val(r?) for some r) e ^diff 
10 B:={ A {C' = {T[C',rU^-\T[C\r])})\ 

C'ecKay 

cKey e CandidateKeys(T)}; 

iJ foreach column C of table T s.t. C ^ C : 

12 if ((r?-val-i(T[C',r])) = ±) 

i5 then ■[ 77 := NewNode(); 77 ~ J7 U {77} ; 

14 val(77) :=r[C",r];Progs[77] :=0; > 

15 ProgsM := Progs[77] U {Select(C", T, B)} ; 

16 return (77, val~^(s), Progs) ; 

(a) 

Figure 5: (a) The GenerateStrt procedure for generating the set of all expressions (of depth at most k) 
in language Lt that are consistent with a given input-output example, (b) The Intersect* procedure for 
intersecting sets of expressions from language Lt. The Intersect* procedure returns in all other cases. 



Intersect* (t;^, Vi) 
Intersectt(Select(C, T, B), 
Select(C,T,S')) 
where B" = Intersectt(i3, S') 
Intersectt({6i}i, {6i}i) 
Intersect* ({pi}j, {Pi}i) 
Intersect* (C = s,C = s) 
Intersect* (C — rii,C — 772) 
Intersectt(C = {5,771}, 
C = {5,772}) 

Intersect*(C = {si,77i}, = C = {(r?i, 772)}, if si ^ S2 

C = {52, 772}) 

(b) 



Select(C, T,B" 



{Intersect*(6i, 6-)}i 
{Intersect*(pi,pi)}i 
C = s 

C = (771,772) 

C = {3,(771,772)} 



also reachable and creates nodes 772 and 773 corresponding 
to them, and then sets Progs[?72] = {Select(C2, Ti, B)} 
and Progs[?73] = {Select(C3, Ti, B)}. In the next iter- 
ation of loop, the frontier of reachable set is updated to 
»7diff = {ri2,rji} and the nodes that are reachable from this 
set are next computed. The algorithm finds that table entry 
22 [Ci, 1] is reachable from node 772 and thereby makes nodes 
773 and 774 reachable as well with corresponding Select ex- 
pressions. Similarly, the nodes T/4 and 775 become reachable 
from 773. In this manner, the algorithm keeps computing the 
set of reachable table entries iteratively until k iterations, 
where k is set to the number of relational tables n. 

Procedure intersect* 

The Intersect* procedure takes two sets of expressions in 
Lt as input and computes the set of expressions that are 
common to both the sets. (Both input and output sets are 
represented using the data structure Dt ) Figure 5(b) de- 
scribes the Intersect* procedure for intersecting the sets of 
Lt expressions using a set of rules that are pattern matched 
for execution. For intersecting two expressions (771 , 77^; , Progs-^ ) 
and {q'z. rfzjVicogs.^) ; we take the cross product of the set of 
nodes to get the new set of nodes 7712 = (r;i x 772) with the 
target node (771,772), and compute the new ProgSjj map for 
each node (771,772) € 7712 by intersecting their corresponding 
maps ProgSj^[77i] and ProgS2[772] respectively. The intersect 
rule for two select expressions requires the column name 
and table id to be the same and intersects the conditionals 
recursively. The candidate keys bi as well as each column 
conditional p in a candidate key are intersected individually 
maintaining their corresponding orderings. 

Theorem 2 (Synthesis Algorithm Properties). 

(a) The procedure GenerateStrt is sound and k-cornplete. 
The computational complexity of GenerateStrt procedure 
(and hence the size of the data structure constructed by it) 
is 0{t^ p m) where t is the number of reachable strings in 
k iterations, p is the maximum number of candidate keys in 



any table and m is the maximum number of columns in any 
candidate key. (b) The procedure Intersectt is sound and 
complete. The computational complexity of Intersectt pro- 
cedure ( and hence the size of the data structure returned by 
it) is 0{d^), where d is the size of the input data structures. 

The proof of Theorem 2 is given in [16]. 

4.4 Ranking 

In this section, we define a partial order between expres- 
sions in Lt that we use for ranking of these expressions. We 
prefer expressions of smaller depth (fewer nested chains of 
Select expressions). We prefer lookup expressions that use 
distinct tables for join queries (the most common scenario 
for end-users) as opposed to expressions involving self-joins. 
We prefer conditionals that consist of fewer predicates and 
prefer predicates that involve comparing columns with other 
table entries or input variables (as opposed to comparing 
columns with constant strings). 

5. SEMANTIC TRANSFORMATIONS 

We now present an extension of the lookup transformation 
language L* (described in §4) with a syntactic transforma- 
tion language Lg (from [8]). This extended language L„, 
also referred to as semantic transformation language, adds 
two key capabilities to L*: (i) It allows for lookup trans- 
formations that involve performing syntactic manipulations 
(such as substring, concatenation, etc.) on strings before 
using them to perform lookups, and (ii) It allows for per- 
forming syntactic manipulations on lookup outputs (which 
can then be used for performing further lookups or for gen- 
crating the output string). This extension, as we show in 
§6, also enables us to model transformations on strings rep- 
resenting standard data types such as date, time, etc. We 
first describe a syntactic transformation language. 

Syntactic Transformation Language Ls (Background). 
Gulwani [8] introduced an expression language for perform- 
ing syntactic string transformations. We reproduce here a 
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small subset of (the rules of) that language and call it Ls 
(with Cs being the top-level symbol). 

Cs := Concatenate (fi, . . . ,f„) | f 
Atomic expr f := ConstStr(s) | Vi \ SubStr(t;i, p^, pj) 

Position p := A: I pos(ri,r2,c) 
Integer expr c := k\k\w + k2 
Regular expr r := e | r | TokenSeq(ri, . . . , Tn) 

The formal semantics of Ls can be found in [8]. For com- 
pleteness, we briefly describe some key aspects of this lan- 
guage. The top-level expression es is either an atomic ex- 
pression f or is obtained by concatenating atomic expressions 
fi,. . ..in using the Concatenate constructor. Each atomic 
expression f can either be a constant string ConstStr(s). an 
input string variable Vi, or a substring of some input string 
Vi. The substring expression SubStr(wj,pj,p2) is defined 
partly by two position expressions pj and Pj, each of which 
implicitly refers to the (subject) string Vi and must evaluate 
to a position within the string Vi. (A string with I char- 
acters has £ + 1 positions, numbered from to t starting 
from left.) SubStr(wi, pj^, pj) is the substring of string Vi in 
between positions pj and pj. A position expression repre- 
sented by a non-negative constant k denotes the k*^ posi- 
tion in the string. For a negative constant k, it denotes the 
(£ + 1 + fc)*'' position in the string, where H — Length(s). A 
position expression pos(ri,r2,c) evaluates to a position t in 
the subject string s such that regular expression ri matches 
some sufflx of s[0 : i], and 12 matches some prefix of s[t : £], 
where i = Length.(s). Furthermore, if c is positive (nega^ 
tive), then t is the |c|*'' such match starting from the left 
side (right side). We use the expression s[t\ : t'2\ to denote 
the substring of s between positions ti and t2. We use no- 
tation SubStr2(vi, r, c) as an abbreviation to denote the c"' 
occurrence of r in Vi, i.e., SubStr(vi, pos(e, r, c), pos(r, e, c)). 

A regular expression r is either a token r, a token sequence 
TokenSeq(ri, Tn) or e (which matches the empty string). 
The tokens r range over some finite (but easily extensible) 
set and typically correspond to character classes and spe- 
cial characters. For example, tokens UpperTok, NumTok, and 
AlphTok match a nonempty sequence of uppercase alpha- 
betic characters, numeric digits, and alphanumeric charac- 
ters respectively. DecNumTok matches a nonempty sequence 
of numeric digits and/or decimal point. SlashTok matches 
the slash character. Special tokens StartTok and EndTok 
match the beginning and end of a string respectively. 

Example 4. An Excel user wanted to transform names 
to a format where the last name is followed by the initial 
letter of the first name, e.g., "Alan Turing" — >■ "Turing 
A". An expression in Ls that can perform this transfor- 
mation is: Concatenate{fj^, Const Str{" "),/2) where f^ = 
Sul>Str2{vi,AlphTok,2) and f^ = SubStr2{vi, UpperTok,!). 
This expression constructs the output sting by concatenating 
the 2"'' word of input string, the constant string whitespace, 
and the 1"* capital letter in input string. 

For more details on the syntactic transformation language 
Ls, see [8]. Now we present the extended language L„. 

5.1 Semantic Transformation Language L„ 

Let Rt and Rs denote the set of grammar rules of lan- 
guages Lt and Ls respectively. We subscript each non- 
terminal in the two languages with t and s for disambiguat- 
ing the names of non-terminals in the extended language. 



For example, fs denotes the atomic expression f of the syn- 
tactic transformation language Ls. The expression grammar 
of the extended language Lu consists of rules RtURs in which 
the following rules are modified (with modifications shown 
in bold), and with Cs as the top-level symbol. 



Atomic expr fs 
Predicate pt 

Cs 



= ConstStr(s) | et | SubStr(et, p^^ , p^ J 

= C = s\C = e^ 

= Concatenate(fsi , . . . , fs„ ) | f 3 



The top-level expression of the extended language is ei- 
ther an atomic expression fs or a Concatenate operation 
on a sequence of atomic expressions fs^, as before. How- 
ever, the atomic expression fs is updated to consist of a 
lookup expression et or its substring SubStr(et, p^^ , p^^) (as 
opposed to only an input variable Vi or its substring). This 
lets the language model transformations that perform syn- 
tactic manipulations over table lookup outputs. The other 
modification is in the predicate expression pt of language 
Lt, where we modify the conditional expression C = et to 
C = Cg. This enables the language to model lookup trans- 
formations that perform syntactic manipulations on strings 
before performing the lookup. The updated rules have ex- 
pected semantics and can be defined in a similar fashion as 
the semantics of rules in Lt and Ls. We now illustrate the 
expressiveness of the extended language using few examples. 

The transformation in Example 1 can be represented in Lu 
as: Concatenate(fi , ConstStr (' '+0 . ") , f2 , ConstStr (' '*' ') , fs ) , 
fi = Select(Price, CostRec, Id = f4 A Date = fs), 
f4 = Select(ld, MarkupRec, Name = vi), 
fs = SubStr(«2, pos(SlashTok, e, 1), pos(EndTok, e, 1)), 
f2 = SubStr2(f6, NumTok, 1), fs = SubStr2(fi, DecNumTok, 1), 
fg = Select(Markup, MarkupRec, Name — Vi). 

The expression f4 looks up the Id of the item in vi from 
the MarkupRec table and expression fs generates a substring 
of the date in V2, which are then used to look up the Price of 
the item from the CostRec table (fi). The expression fg looks 
up the Mcirkup percentage of the item from the MarkupRec 
table and f2 generates a substring of this lookup value by ex- 
tracting the first numeric token (thus removing the 7. sign). 
Similarly, the expression fs generates a substring of fi, re- 
moving the $ sign. Finally, the top-level expression concate- 
nates fi, f2, and fs with constant strings "+0." and 

Example 5. Indexing with concatenated strings: 

A bike merchant maintained an inventory of BikePrices ta- 
ble, and wanted to compute the price quote table by perform- 
ing lookup of bike Price after concatenating the bike name 
(vi) and the engine cc value (V2) as shown in Figure 6. 



Input VI 


Input V2 


Output 


Honda 


125 


11,500 


Ducati 


100 


10,000 


Honda 


250 


19,000 


Ducati 


250 


18,000 



BikePrices 


Bike 


Price 


DucatilOO 


10,000 


Ducatil25 


12,500 


Ducati250 


18,000 


Hondal25 


11,500 


Honda250 


19,000 







Figure 6: A lookup transformation that requires 
concatenating input strings before performing selec- 
tion from a table. 

The desired transformation can be expressed in the ex- 
tended language as: Select(Price, BikePrices, Bike = e^) 
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where Cs = Concatenate(Di , V2). The expression concate- 
nates the two input string variables vi and V2 , which is then 
used to perform the lookup in the BikePrices table. 

Example 6. Concatenating table outputs: A user 

had a series of three company codes in a column and wanted 
to expand them into the corresponding series of company 
names using a table Comp that mapped company codes to the 
company names as shown in Figure 7. 







Comp 






Id 


Name 


Input vi 


Output 


cl 


Microsoft 


c4 c3 cl 


Facebook Apple Microsoft 


c2 


Google 


c2 c5 c6 


Google IBM Xerox 


c3 


Apple 


cl c5 c4 


Microsoft IBM Facebook 


c4 


Facebook 


c2 c3 c4 


Google Apple Facebook 


c5 


IBM 






c6 


Xerox 











Figure 7: A nested syntactic and lookup transfor- 
mation. It requires concatenating results of multi- 
ple lookup transformations, each of which involves a 
selection operation that indexes on some substring 
of the input. 



This transformation is expressed in Lu as: 
Concatenate(fi, ConstStr(" "), f2, ConstStr( " "),f3), where 
fi s Select(Name,Comp, Id = SubStr2(vi, AlphTok, 1)), f2 = 
Select(Naine,Comp, Id = SubStr2(t;i, AlphTok, 2)) and is = 
Select(Name,Comp, Id = SubStr2(vi, AlphTok, 3)). The ex- 
pressions fi, f2, and fa extract the first, second, and third 
words from the input string respectively, which are then 
used for performing the tabic lookups and the results arc 
concatenated with whitcspaccs to obtain the output string. 

5.2 Data Structure for Set of Expressions in Lu 

Let Rt and Rs denote the set of grammar rules for the data 

structures that represent set of expressions in languages Lt 
and Ls respectively (See [8] for description of Rs). We con- 
struct the grammar rules for the data structure that rep- 
resents set of expressions in the extended language Lu by 
taking the union of the two rules Rt U Rs and modifying 
some rules as follows: 

fs := ConstStr(s) I et I SubStr(et,Psi,PsJ 
pt ■■= C = s I C = is 

is := Dag(a, a% a*,!, W) | where W : i ^ 2^' 

The most interesting aspect of this data structure is the 
T)a.g{a, a", a*, ^, W) construct, which succinctly represents a 
set of Concatenate expressions in L„ using a dag (directed 
acyclic graph), where fi is a set of nodes containing two 
distinct source and target nodes a" and a', ^ is a set of 
edges over nodes in a that induces a dag, and W maps each 
edge in ^ to a set of atomic expressions. The semantics [.] 
of the Dag constructor is: 

[Dag(a, a%a\i, W)} = {Concatenate(f.i , . . . , f.„) | 

fsi € lW(^i)l,^i, ..,^n € ^ form a path between a" and a*} 

The set of all Concatenate expressions represented by the 
Dag constructor includes exactly those whose ordered argu- 
ments belong to the corresponding edges on any path from 
a" to a* . This dag representation is similar to the represen- 
tation of string expressions in [8]. However, in our case, the 



edges of the dag consist of more sophisticated (substrings 
of) lookup expressions, whose predicates can in-turn be rep- 
resented using nested-dags. 

Consider Example 6, where the input string is "c4 c3 cl" 
and the output string is "Facebook Apple Microsoft" (of 
length 24) . The dag G that represents the set of all transfor- 
mations consistent with this input-output pair is shown in 
Figure 8. For better readability, wc only show some of the 
relevant nodes and edges of the dag G. The edge from node 
to node 8 corresponds to all expressions ei that generate 
the string Facebook. One of the lookup transformations, 
Select(Name, Comp, Id — /i), in ei requires syntactic trans- 
formations /i to extract substring cl from the input string, 
where /i is itself represented as a nested-dag as shown in 
the figure. The edges for expressions es and es also consist 
of similar nested-dags. 

Theorem 3 (Properties of data structure Du). 

(a) The number of transformations in Lu that are consistent 
with a given input-output example may be exponential in the 
number of reachable entries, the number of columns in a pri- 
mary key, and the length of the largest reachable string. 

(b) However, the data structure Du can represent these po- 
tentially exponential number of transformations in size poly- 
nomial in the number of reachable entries, the number of 
primary keys, the number of columns in a primary key, and 
the length of the largest reachable string. 

(a) Proof of (a) follows from Theorem 1(a). The number 
of transformations can also be exponential in the length of 
the largest reachable string as we are using GenerateStrg 
procedure for checking reachability which has this worst case 
complexity, (b) Wc show that the size of the data structure 
generated is 0(t p m f) in Theorem 4(a). 

5.3 Synthesis Algorithm for Lu 

Procedure GenerateStr„ 

Recall that the GenerateStrt procedure for language Lt 
(§4.3) performs reachability on table entries based on ex- 
act string matches (T[C, r] — r/). The key idea in case of the 
language Lu is to perform a more relaxed reachability on ta- 
ble entries taking into account the possibility of performing 
syntactic manipulations on previously reachable strings. 

We first define a GenerateStrJ procedure by making two 
modifications to the GenerateStr* procedure. First, we 
replace the condition "T[C, r] = val(r7)" in GenerateStrt 
(Line 9 in Figure 5(a)) by the condition "GenerateStrs(cr U 
?7,T[C, r]) contains any expression that uses a variable from 
a U fj". We use the notation a U ?7 to denote a state that ex- 
tends a and maps to val(77) for all rj £ fj. The GenerateStrJ 
procedure marks a table entry as reachable if it can be com- 
puted using the GenerateStr^ procedure (from [8]) on pre- 
viously reachable strings. The GenerateStr^ procedure can 
perform concatenation of constant strings and substrings of 
previously reachable strings (fj). We add an additional check 
that the expressions synthesized by GenerateStr^ contains 
a variable from cr U 77 to avoid expressions containing only 
constant string expressions. For our experiments, we en- 
force an even stronger restriction that there exists a string 
77 G (a U 77) such that either T[C,r] is substring of r; or 77 
is a substring of T[C,r]. This provides efficiency without 
any practical loss of precision. The second modification is 
in Line 10 in Figure 5(a), where we replace the generalized 
predicate with C' = {GenerateStrs(o- U 77, T[C", r])}. 
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ei 62 63 64 65 
<8) <9)— ^^5) 4!) W 

fi = Dag({ao, tti, 02}, 00,02, {{ao, oi), (0102), (0002)}, W2}) 
where VF2((ao,oi)) = ee, W2((ai,a2)) = ey, W^2((ao,a2)) = es 

G = Dag({0, . . . , 24), 0, 24, {(i, j> | < i < j < 24}, Wi) 
where lVi({0,8)) = ei,W'2({8, 9)) = e2,H^2((9, 14)) = £3, ... 



ei s {Select(Name, Comp, Id = /i), 

ConstStr("Facebook"), . . .} 

62 s {ConstStr(" "), . . .} 

63 = {Select (Name, Comp, Id = /2), 

ConstStr("Apple"), . . .} 

64 = {ConstStr(" "),...} 

65 = {Select(Name, Comp, Id = /s), 

ConstStr( "Microsoft"), . . .} 

66 = {ConstStr("c"), . . .} 

67 = {ConstStr("l" ),...} 

68 = {SubStr2(vi, AlphTok, 1), . . .} 



Figure 8: A partial Dag representation of the set of expressions in Example 6. 



The GenerateStr^ procedure for the extended language 
Lu can now be defined as; 

GenerateStr^ (cr : Input state, s: Output string) 

1 (r;, 77*, Progs) = GenerateStrj(a, s); 

2 return GenerateStrs(cr U 77, s) ; 

The GenerateStru procedure first constructs the set of all 
reachable table entries (77) from the set of input strings 
in a using the GenerateStrJ procedure. It then uses the 
GenerateStrs procedure to construct the Dag for generating 
the output string s from the set of strings that includes val- 
ues of input variables (in state cr) as well as the reachable 
table entries (represented by fj). 

Consider the first input-output pair ("c4 c3 cl", "Face- 
book Apple Microsoft") in Example 6. The GenerateStr„ 
algorithm first uses the GenerateStrJ procedure to compute 
the set of reachable table entries. The GenerateStrJ proce- 
dure finds that the table entry T[Id,4] = c4 is reachable 
from the input string "c4 c3 cl" as there exists an expres- 
sion SubStr2(ui, AlphTok, 1) that can generate the string c4. 
It then adds the string Facebook from row 4 to the reachable 
set 77. Similarly, table entries c3, Apple, cl, and Microsoft 
are also added to 77. It then uses the GenerateStrg pro- 
cedure to construct a dag for generating the output string 
"Facebook Apple Microsoft" from the set {"c4 c3 cl", 
cl, c2, c3, Facebook, Apple, Microsoft} (aU77). It first 
creates a Dag of 25 nodes (a — {0, . . . , 24}) with as the 
source node and 24 as the target node. The algorithm then 
assigns a set of expressions to each edge {i, j) that can gener- 
ate the substring s[i,j]. For example, the algorithm adds 
the expression that selects the string Facebook from <t U 77 
to the edge (0,8); the expression in turn corresponds to a 
lookup transformation with nested sub-dags as shown in Fig- 
ure 8. Similarly, it adds the expression 62 that generates a 
whitespace to the edge (8,9), and so on. 

Procedure intersect^ 

The Intersect^ procedure for the extended language 
consists of the union of rules of Intersectt and Intersect^ 
procedures along with the following four additional rules. 

Intersectu(et, ej) = Intersectt(et, ej) 

Intersectu(C = Ss, C = 63) = (C = Intersects(es, e^)) 
Intersect„(SubStr(et, p^^ , p^ J, SubStr(ei, p'^^ , p'^ J) = 
SubStr(lntersectt(et, ej), IntersectPos(pg^ , p'^j^), 
IntersectPos(p^2 , p'^^ )) 



Intersect„(Dag(ai, Qi, ai,|i, Wi),Dag(a2, 03, a2,^2,W2)) 

= Dag(ai X Q2, (qi, 02), ("i, a2),fi2, IV12), where 

|i2 = {((ai,a2), (ai,a2)) | (ai,a'i) £ |i,(a2,a2) £ I2} 

and I¥i2(((ai, 02), (0:1,02))) = {Intersects (f,f') | 

f e m((Qi,Q'i)),f e W'2((a2,a2))} 

The intersect rule for Dag intersects the two dags in a man- 
ner similar to the intersection of two finite state automatons. 
The new mapping W\2 is computed by performing intersec- 
tion of the expressions on the two corresponding edges of 
the dags. (Rules for Intersects are defined in [8].) 

Theorem 4 (Synthesis Algorithm Properties). 

(a) The GenerateStru procedure is sound and k-complete. 
The computational complexity of GenerateStru procedure is 
0{t^ p rn £*) (assuming 0{l^) complexity for the new check 
on Line 9), and the size of the data structure constructed by 
it is 0{t'^ p m where t is the number of reachable strings 
in k iterations, p is the maximum number of candidate keys 
of any table, rn is the maximum number of columns in any 
candidate key, and I is the length of the longest reachable 
string, (h) The Intersectu procedure is sound and com- 
plete. The computational complexity of Intersectu (and 
hence the size of the data structure returned by it) is 0{d^), 
where d is the size of the input data structures. 

The proof of Th. 4 and more details about 0{l^) assump- 
tion for check on Line 9 are given in [16]. The worst-case 
quadratic blowup in the size of the output returned by the 
Intersect procedure doesn't happen in practice (as we re- 
port in §7) making the synthesis algorithm very efficient. 

5.4 Ranking 

The partial orders of ranking schemes of Lt and Ls are 
also used to rank expressions in Lu. In addition, we define 
some additional partial orders for expressions in Lu. We 
prefer lookup expressions that match longer strings in table 
entries for indexing than the ones that match shorter strings. 
We prefer lookup expressions with fewer constant string ex- 
pressions and ones that generate longer output strings. 

6. STANDARD DATA TYPES 

The language L„ can also model a rich class of transfor- 
mations on strings that represent standard data types such 
as date, time, phone immbcrs, currency, or addresses. Ma- 
nipulation of these data types typically requires some back- 
ground knowledge associated with these data types. For 
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Figure 9: User interface of our programming-by- 
example Excel add-in. (a) and (b) are the screen- 
shots before and after clicking the Apply button. 

example, for dates we have the knowledge that month 2 
corresponds to the string February, or for phone numbers 
we have the knowledge that 90 is the ISD code for Turkey. 
This background knowledge can be encoded as a set of rela- 
tional tables in our framework (once and for all). This allows 
the synthesis algorithm for Lu to learn transformations over 
strings representing these data types. We now present some 
examples from Excel help-forums where users were strug- 
gling with performing manipulations over such strings. 

Example 7 (Time Manipulation). An Excel user 
needed to have spot times (shown in column 1 in Figure 9(a)) 
converted to the hh:mm AM/PM format (as shown in col- 
umn 2 in Figure 9(a)). The Excel user posted thts problem 
on a help-forum to which an expert responded by providing 
the following macro: TEXT(C1, "00 00 ")+0 
When we showed this example to a team of Excel experts in 
a live presentation, it drew a response: "There are 40 differ- 
ent ways of doing this!". When we asked them to describe 
any one of those ways, we got the response "I don't exactly 
remember how to do it. I will have to investigate". 

We encode the background knowledge concerning time in 
a table Time with three columns 24Hour,12Hour and AMPM, 
where the first column constitutes a primary key, and so does 
the combination of the second and third columns. The table 
is populated with 24 entries: {0,0, AM), . . ., (11, 11, AM), 
(12,12, PM), (13,1, PM), {23,11, PM). The desired 
transformation can be represented in our language as: 
Concatenate(Select(l2Hour, Time, 6i), ConstStr(" : "), 
SubStr(ui, -3, -1), ConstStr(" "), Select (AMPM, Time, bi) 
where &i = (24Hour = SubStr(i'i, pos(StartTok, e, 1), —3)). 
The SubStr expression in bi computes the substring of 
the input between the start and 3^"^ character from end, to 
compute the hour part of the time in column vi . This hour 
string is then used to perform lookup in table Time to com- 
pute its corresponding 12Hour format and AMPM value. These 
lookup strings are then concatenated with the minute part 
of the input string and constant strings : and whitespace. 

Example 8 (Date Manipulation). An Excel user 
wanted to convert dates from one format to another as shown 
in Figure 10, and the fixed set of hard-coded date formats 
supported by Excel 2010 do not match the input and output 
formats. Thus, the user solicited help on a forum. 



Input VI 


Output 


6-3-2008 


Jun 3rd, 2008 


3-26-2010 


Mar 26th, 2010 


8-1-2009 


Aug 1st, 2009 


9-24-2007 


Sep 24th, 2007 



Figure 10: Formatting dates using examples. 

We encode the background knowledge concerning months 
in a table Month with two columns MN and MW, where each of 
the columns constitutes a candidate key by itself. The table 
is populated with 12 entries: (1, January), . . ., (12, December). 
We also maintain a table DateOrd with two columns Num and 
Ord, where the first column constitutes a primary key. The 
table contains 31 entries (1, st), (2, nd), (3, rd), . . ., (31, st). 
The desired transformation is represented in L„ as : 
Concatenate(SubStr(Select(MW, Month, MN = ei), 
pos(StartTok, e, 1), 3), ConstStr(" "), 62, 
Select(Ord, DateOrd, Num = 62), ConstStr(", "),e3) 
where ei = SubStr2(i'i, NumTok, 1), e-i = SubStr2(wi, NumTok, 
2), and 63 = SubStr2(i;i, NumTok, 3). 

The expression concatenates the following strings: the 
string obtained by lookup of first number token of vi in table 
Month, the constant string whitespace, the second number 
token of v\ , the string obtained by lookup of second number 
token of v\ in table DateOrd, the constant string ", ", and 
the string corresponding to third number token of vi. 

Unfortunately, it is not possible to encode semantics of 
data-types with infinite domains using relational tables. One 
such data-type is numbers, which often entail rounding and 
formatting transformations [17]. 

7. EXPERIMENTS 

We have implemented the inductive synthesis algorithm 
for the transformation language in C# as an add-in for 
Microsoft Excel Spreadsheet system as shown in Figure 9. 
We hard-code a few useful relational tables of our own (such 
as the one that maps month numbers to month names), 
while also allowing the user to point to existing Excel tables 
to be used for performing the transformation. 

Benchmarks: We report experimental results on a set 
of 50 problems collected from several Excel help-forums and 
the Excel product team (including all problems described in 
this paper). Out of these 50 problems, 12 problems can be 
modeled in the lookup language Lt whereas the remaining 
38 of them require the extended language L„. The detailed 
description of these 50 problems can be found in [16]. 

Effectiveness of data structure Di,: We first present 
the statistics about the number of expressions in Lu that 
are consistent with the user-provided set of input-output 
examples for each benchmark problem in Figure 11(a). The 
figure shows that the number of such consistent expressions 
are very large and are typically in the range hom 10^" to 
10'^''. Figure 11(b) shows that the size of our data structure 
Du to represent this large number of expressions typically 
varies from 100 to 2000, where each terminal symbol in the 
grammar rules of the data structure contributes a unit size 
to the size of the data structure. 

Effectiveness of ranking: Use of a ranking scheme en- 
ables users to provide fewer input-output examples to auto- 
mate their repetitive task. Hence, the eff'ectiveness of our 
ranking scheme can be measured by the number of examples 
required for the intended program to be ranked as the top- 
most program. In our evaluation, all benchmark problems 
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Figure 11: (a) Number of expressions consistent with given i-o examples and (b) Size of data structure to 
represent all consistent expressions. 
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Figure 12: (a) Running time (in seconds) to learn 
the program and (b) Size of the data structure 
before and after performing Intersect^. 

required at most 3 examples to learn the transformation: 35 
benchmarks required 1 example, 13 benchmarks required 2 
examples and 2 benchmarks required 3 examples. 

Performance: We present the running time of our syn- 
thesis algorithm to learn the desired transformation for each 
benchmark problem in Figure 12(a) (sorted in increasing or- 
der). Note that 88% of benchmark problems finished in less 
than 1 second each and 96% of problems finished in less 
than 2 seconds each. The experiments were performed on a 
machine with Intel Core 17 1.87GHz CPU and 4GB RAM. 

Size of data structure after Intersection: Finally, we 
show empirically that the Intersect^ procedure does not 
cause a quadratic blowup in the size of the data structure 
for any of our benchmark problems. We compare the sizes 
of the data structure corresponding to the first input-output 
example and the data structure obtained after performing 
the Intersect^ procedure in Figure 12(b). As we can see 
the size of the data structure mostly decreases after inter- 
section and increases slightly in a few cases, but it is very 
far from a quadratic increase in its size. 

8. RELATED WORK 

Within the database literature, our work is most closely 
related to the problems of record matching^ learning schema 
matches and query synthesis. We have detailed some differ- 
ences below, but the most significant difference is that we 
put these concepts together. 

Record Matching: The task of syntactic manipulation 



performed before a lookup operation in our extended trans- 
formation language can be likened to the problem of record 
matching. Most of the prior work in this area [5, 12] has 
focused on designing appropriate similarity functions such 
as edit distance, jaccard similarity, cosine similarity, and 
HMM25. A basic limitation of most of them is that they 
have limited customizability. Arasu et.al. have proposed 
a customizable similarity measure that can either be user- 
programmed [1] or can be inferred from examples of match- 
ing textual records [2]. In both these cases, the underlying 
transformation rules only involve constant strings, e.g., US 
— )■ United States. Our record matching is also inferred from 
examples, but it involves generalized transformation rules 
consisting of syntactic operations such as regular expression 
matching, substring, and concatenate. 

Learning Complex Schema Matches: The problem of 
synthesizing semantic string manipulations is also related to 
the problem of finding complex semantic matches between 
the data stored in disparate sources. The iMAP system [4] 
finds the schema matches that involve concatenation of col- 
umn strings across different tables using a domain-oriented 
approach. Another approach by Warren and Tompa [19] 
learns the relationships that involve concatenation of col- 
umn substrings, but within a single table using a greedy 
approach. Our language-theoretic approach learns relation- 
ships that involve concatenation of column substrings across 
multiple database tables without using any domain knowl- 
edge about the column entries. 

Query Synthesis by Example: The view synthesis [3, 18] 
problem aims to find the most succinct and accurate query 
for a given database view instance. The high-level goal of 
this work is similar to that of our inductive synthesis algo- 
rithm for the lookup transformation language Lt, but there 
are some key differences: (i) View synthesis techniques infer 
a relation from a large representative example view, while 
we infer a transformation from a set of few example rows 
(which is a critical usability aspect for end-users). (11) View 
synthesis techniques infer the most likely relation, while our 
lookup synthesis algorithm infers a succinct representation 
of all possible hypotheses, which enables its extension to a 
synthesis algorithm for the language L„. (iii) The technique 
in [3] does not consider join or projection operations. 
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Text-editing Systems using Demonstrations and Ex- 
amples: QuickCode [8] is a programming by example sys- 
tem for automating syntactic string transformations in spread- 
sheets. It synthesizes programs with restricted form of regu- 
lar expressions, conditionals, and loops for performing syn- 
tactic string transformations. Our work leverages Quick- 
Code to perform semantic string transformations. 

Programming-by-denionstration systems for text-editing 
like SMARTedit [14] or simultaneous editing [15] require the 
user to provide a complete demonstration or trace, where the 
demonstration consists of a sequence of the editor state after 
each primitive action, really spelling out how to do the trans- 
formation, but on a given example. This is the main reason 
why such systems have not yet been very successful J13]. Our 
system is based on Programming by Example (as opposed 
to Programming by Demonstration) - it requires the user 
to provide only the final state (as opposed to also providing 
the intermediate states). 

The work in [10] describes a programming by example 
technology for learning layout transformations on tables. In 
contrast, this paper describes a learning algorithm for syn- 
thesizing string transformations based on table lookups. 

The systems described above are structured along the gen- 
eral formalism described in §3; however, this paper presents 
an instantiation to a novel domain of semantic string ma- 
nipulation based on a novel learning algorithm. None of the 
examples that we describe in this paper can be addressed 
by any of these systems (except Example 4, which can be 
handled by QuickCode) because they don't implement any 
reasoning about semantic data types. 

9. CONCLUSION AND FUTURE WORK 

Program synthesis is the task of synthesizing a program in 
some underlying language from specifications that can range 
from logical declarative specifications to examples or demon- 
strations. This topic has been studied extensively in Al and 
PL communities with the goal of easing the burden of algo- 
rithm designers or software developers. (See [7] for a recent 
survey.) Since program synthesis is a hard combinatorial 
problem, and these users write sophisticated programs, we 
have not yet been able to design robust tools that can pro- 
vide significant value to this class of users on a daily basis. 
As a result, existing program synthesis techniques have not 
yet found significant adoption in real world. In contrast, this 
paper targets end-users, whose needs are much simpler com- 
pared to those of software developers. This paper presents 
one such tool that is ready to be deployed for use by end- 
users in real world. More significantly, the impact potential 
is huge: 500 million spreadsheet users who struggle with 
spreadsheets on a daily basis! We believe that further re- 
search in this area of program synthesis for end-users can 
potentially bring a computing revolution by democratizing 
the ability to effectively use computational devices. 

In this paper, we have considered spreadsheet tables, which 
are typically small in size and lead to real-time performance 
of our learning algorithm. It would be interesting to con- 
sider improvements to our learning algorithm to allow for 
efficient handling of larger database tables (where the num- 
ber of reachable strings can be huge). There might also be 
an opportunity for designing new interaction models where 
users may point out the set of relevant sub-tables. 
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